3 days ago i was looking through the gtdbtk manual and saw that
de_novo_wf was an option for analysis to create the trees,
from the description given:
knitr::include_url("https://ecogenomics.github.io/GTDBTk/commands/de_novo_wf.html")
i beleived this would be something i should do as it might produce
more accurate trees. sample 1Dt2d Enterobacter
cancerogenus had been placed by the classify_wf in the
previous gtdbtk analysis in the genus Pantoea, which lead me to
this search. after a bit of trial and error, i produced this
script
This ran as a slurm job on hawk (SCW) from rougly 20:10 on the 23rd to 01:00 on the 24th, totalling 4 hours and 50 minutes. The main parameters that i experimented with were
- #SBATCH --ntasks=5
- #SBATCH --time=24:00:00
- #SBATCH --mem=50g
- --cpus 10
I settled on these as being the “best”, however, it is entirely possible that they could be more optimised.
This analysis produced these files:
/scratch/scw2160/02_outputs/flye_asm/gtdb_tk_de_novo5/
.:
text.txt
ls
touch
list.txt
align
gtdbtk.bac120.decorated.tree
gtdbtk.bac120.decorated.tree-table
gtdbtk.log
identify
infer
gtdbtk.warnings.log
./align:
gtdbtk.bac120.msa.fasta.gz
gtdbtk.bac120.user_msa.fasta.gz
gtdbtk.bac120.filtered.tsv
./identify:
gtdbtk.ar53.markers_summary.tsv
gtdbtk.bac120.markers_summary.tsv
gtdbtk.translation_table_summary.tsv
gtdbtk.failed_genomes.tsv
./infer:
gtdbtk.bac120.decorated.tree
gtdbtk.bac120.decorated.tree-taxonomy
gtdbtk.bac120.decorated.tree-table
intermediate_results
./infer/intermediate_results:
gtdbtk.bac120.rooted.tree
gtdbtk.bac120.fasttree.log
gtdbtk.bac120.tree.log
gtdbtk.bac120.unrooted.tree
I then moved this gtdbtk.bac120.decorated.tree file into
Dendroscope for review, all 10 are on one tree, but 1Dt2d
is still being placed in the “wrong” genus. on review of its sister
accession on the ncbi database.
On the NCBI page for the sister accession, can be found a CheckM analysis that comes back with
completeness: 90%
contamination: 3.6%
Taxonomy check status: failed
Upon viewing the tree in Dendroscope, the joining node has the label
0.968. This I believe to be the probability the
relationship is correct. this implies they are the same species, and the
online sample is also identified as Enterobacter cancerogenus.
However, due to the checkm analysis i find it plausible that they both
have been misidentified and are in reality Pantoea species, i find this
the most parsimonious explanation. I will follow this up with a CheckM
analysis of my own on 1Dt2d
This was a “technical spike” or proof of concept for
de_novo_wf
mtcars[1:5, "mpg"]
## [1] 21.0 21.0 22.8 21.4 18.7
To make sure that we always get a data frame, we have to use the
argument drop = FALSE. Now we use the chunk option
class.source = "bg-success".